What are a meaningful measures of engagement that S2M can use to predict longer tenure on app? If we can define behavioral patterns predictive of long-term usage, S2M can identify users most likely to be shorter-term users and target them with nudges to prolong or increase their engagement. Through the descriptive and cohort analyses, we discovered that there is no notable difference betweenshort vs. longer term users with respect to the number of stories viewed on the first day, numberof types of story, and which app feature is used to access stories. As seen in Figure YY, the rate ofdecrease in app usage over time is similar across groups of users with varying levels of first-weekusage. However, further analysis helped uncover the fact that around weeks 8-10 is an importantinflection point for user churn. In order to develop a successful nudging strategy, it is necessary toidentify the users who are likely to have less than 10 weeks of tenure.
We will use the output created from the R Markdown file titled master_clustering.Rmd.
filename = "clustering_features_wk9added"
full_child_data_set <- read.csv(paste(filename, ".csv", sep = ""), header = T, na.strings ="?")In this section, we can define the response variable we want to predict, as well as select the variables we want as predictors. In order for S2M to predict on different outputs and select different inputs, the values within the " " need to be changed.
We selected a 60-20-20 split for the training. validation and test sets, respectively. In order for S2M to try different percentages, the splits probabilities need to be changed, but always need to sum to 1. To produce replicable results we are setting a ‘2021’ seed, however this could be removed for real analysis.
#Splitting Data into Train & Test
set.seed(2021)
# Set response
child_Y <- child_data_set[y_response]
# Set training features
child_X <- subset(child_data_set, select = c(x_features))
# Scale features?
child_S <- apply(child_X, MARGIN = 2, FUN = function(X) (X - min(X))/diff(range(X)))
child_data_set <- cbind(child_Y, child_S)
# Split Shares:
# Training Set - 60%
# Validation Set - 20%
# Test Set - 20%
splits <- sample(1:3, size = nrow(child_data_set), replace = TRUE, prob = c(0.6, 0.2, 0.2))
# Set creation
child_Train <- child_data_set[splits == 1, ]
child_CV <- child_data_set[splits == 3, ]
child_Test <- child_data_set[splits == 2, ]
# Get Training Set
child_Train_Y = child_Train[y_response]
child_Train_X = subset(child_Train, select = c(x_features))
# Get Validation Set
child_CV_Y = child_CV[y_response]
child_CV_X = subset(child_CV, select = c(x_features))
# Get Test Set
child_Test_Y = child_Test[y_response]
child_Test_X = subset(child_Test, select = c(x_features))Through the clustering and data analysis, we identified the Average Number of Stories, Number ofSessions, and Unique Sources to be the main predictors driving long-term tenure, those acted as ourfeatures, while the Number of Days in Freadom since Signup as the response. The time window touse in the features was left as a hypertuning parameter.We developed a supervised learning prediction model of the tenure, to evaluate the performance ofdifferent approaches we used as metric the Root Mean Squared Error (RMSE) to see how off arethe predictions from the true values (on average).
We use the libarary Random Forest to fit a Regression Tree Bootstrap Aggregation method by specifying that all variables (mtry = 3) will be taken into account at each split. We will plot the out of bag (OBB) error to get some information on the prediction error depending on the number of trees created.
library(randomForest)
#Fit a Ranfom Forest
set.seed(2021)
bag.model = randomForest(child_Train_Y[,]~ ., data = child_Train_X, mtry = 3, importance = TRUE)
bag.model##
## Call:
## randomForest(formula = child_Train_Y[, ] ~ ., data = child_Train_X, mtry = 3, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 3198.009
## % Var explained: -2.58
We then run predictions on the test set and plot the True Values vs the Predictions. The closer the line looks to an ascending 45 degree incline, the lower the bias is.
TrbagPred = predict(bag.model, child_Train_X)
TrbagMSE = mean((TrbagPred - child_Train_Y[, ]) ^ 2)
TrbagRMSE = sqrt(TrbagMSE)
paste("The RMSE of the Bagging model is:", TrbagRMSE)## [1] "The RMSE of the Bagging model is: 47.6010917162449"
bagPred = predict(bag.model, child_Test_X)
bagMSE = mean((bagPred - child_Test_Y[, ]) ^ 2)
bagRMSE = sqrt(bagMSE)
paste("The RMSE of the Bagging model is:", bagRMSE)## [1] "The RMSE of the Bagging model is: 57.1622189482128"
To counter the high bias seen on the Linear Models, flexible methods such as Tree-Based can work. Individual trees suffer from high variance- in order to decrease the variance while maintaining a low bias a tree averaging method such as the Bagging (or Bootstrapped Aggregation) model Agets to a low bias and variance. In order to decrease variance while maintaining sampling costs equal, the Bagging method uses apowerful statistical device known as Bootstrapping: by taking repeated samples with replacement from within a pre-sampled set of observations, we are able to synthetically obtain B number of samples and train an statistical model on each of them. Afterwards, we can average over the predictions of all B models to obtain a final prediction.
Based on the initial analysis we performed, S2M should use the Bagging method. In case S2M wants to explore further, below we present other methods we used.
library(tree)
full.tree.model = tree(child_Train_Y[, ]~ ., data = child_Train_X)
summary(full.tree.model)##
## Regression tree:
## tree(formula = child_Train_Y[, ] ~ ., data = child_Train_X)
## Variables actually used in tree construction:
## [1] "n_sessions_first3wks" "avg_n_stories_first3wkly_sessions"
## Number of terminal nodes: 4
## Residual mean deviance: 2933 = 31050000 / 10590
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -71.50 -35.36 -22.36 0.00 19.87 221.60
Plot of the full tree.
Evaluating the CV error of the pruned trees and plotting their deviance (roughly interpreted as the sum of squared error):
## Length Class Mode
## size 3 -none- numeric
## dev 3 -none- numeric
## k 3 -none- numeric
## method 1 -none- character
Computing the RMSE of the full tree
TrtreePred = predict(full.tree.model, child_Train_X)
TrfullTreeMSE = mean((TrtreePred - child_Train_Y[, ]) ^ 2)
TrfullTreeRMSE = sqrt(TrfullTreeMSE)
paste("The RMSE of the full tree is :", TrfullTreeRMSE)## [1] "The RMSE of the full tree is : 54.1445693427257"
treePred = predict(full.tree.model,child_Test_X)
fullTreeMSE = mean((treePred - child_Test_Y[, ]) ^ 2)
fullTreeRMSE = sqrt(fullTreeMSE)
paste("The RMSE of the full tree is :", fullTreeRMSE)## [1] "The RMSE of the full tree is : 55.8302830049567"
set.seed(2021)
require(gbm)
gbm.model = gbm(child_Train_Y[, ]~., data = child_Train_X, distribution = "gaussian", n.trees = 10000, shrinkage = 0.01, cv.folds = 5)
summary(gbm.model)## var rel.inf
## n_sessions_first3wks n_sessions_first3wks 47.590651
## avg_n_stories_first3wkly_sessions avg_n_stories_first3wkly_sessions 44.794212
## unique_sources_first3wks unique_sources_first3wks 7.615138
Number of trees that minimizes the CV error.
boosttrees = length(gbm.model$cv.error)
boostmin = which.min(gbm.model$cv.error)
plot(c(1:boosttrees), gbm.model$cv.error, main = "CV Error vs Number of Trees", xlab = "Number of Trees", ylab = "CV Error")
abline(v = boostmin, lty = 2, col = "blue")## [1] "The minimizing number of trees is: 1930"
Calculating the RMSE
## Using 1930 trees...
TrboostMSE = mean((TrboostPred - child_Train_Y[, ]) ^ 2)
TrboostRMSE = sqrt(TrboostMSE)
paste("The train RMSE of the Boosting model is:", TrboostRMSE)## [1] "The train RMSE of the Boosting model is: 53.4835401890399"
## Using 1930 trees...
boostMSE = mean((boostPred - child_Test_Y[, ]) ^ 2)
boostRMSE = sqrt(boostMSE)
paste("The RMSE of the Boosting model is:", boostRMSE)## [1] "The RMSE of the Boosting model is: 55.0622018237479"
library(randomForest)
#Fit a Ranfom Forest
set.seed(2021)
rf.model = randomForest(child_Train_Y[,]~ ., data = child_Train_X, mtry = 2, importance = TRUE)
summary(rf.model)## Length Class Mode
## call 5 -none- call
## type 1 -none- character
## predicted 10590 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 10590 -none- numeric
## importance 6 -none- numeric
## importanceSD 3 -none- numeric
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 10590 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
Calculating the RMSE
TrrfPred = predict(rf.model, child_Train_X)
TrrfMSE = mean((TrrfPred - child_Train_Y[, ]) ^ 2)
TrrfRMSE = sqrt(TrrfMSE)
paste("The RMSE of the Random Forest model is:", TrrfRMSE)## [1] "The RMSE of the Random Forest model is: 48.3819323196901"
rfPred = predict(rf.model, child_Test_X)
rfMSE = mean((rfPred - child_Test_Y[, ]) ^ 2)
rfRMSE = sqrt(rfMSE)
paste("The RMSE of the Random Forest model is:", rfRMSE)## [1] "The RMSE of the Random Forest model is: 56.0545478170325"
Calculating the RMSE
TrlinearPred = predict(lm.model, child_Train_X)
TrlinearMSE = mean((TrlinearPred - child_Train_Y[, ]) ^ 2)
TrlinearRMSE = sqrt(TrlinearMSE)
paste("The Training RMSE of the Linear Model is:", TrlinearRMSE)## [1] "The Training RMSE of the Linear Model is: 55.1853499149659"
linearPred = predict(lm.model, child_Test_X)
linearMSE = mean((linearPred - child_Test_Y[, ]) ^ 2)
linearRMSE = sqrt(linearMSE)
paste("The RMSE of the Linear Model is:", linearRMSE)## [1] "The RMSE of the Linear Model is: 56.9501433881364"
par(pty = "s")
plot(child_Test_Y[, ], linearPred, ylab = "Linear Regression Prediction", xlab = "True Value")#Lasso
library(glmnet)
cv.out = cv.glmnet(x = as.matrix(child_Train_X), y = child_Train_Y[,], alpha = 1, nlambda = 100)
lasso.model = glmnet(x = as.matrix(child_Train_X), y = child_Train_Y[,], alpha = 1, lambda = cv.out$lambda.min)
lasso.model$beta #Coefficients in Zero## 3 x 1 sparse Matrix of class "dgCMatrix"
## s0
## unique_sources_first3wks -27.56393
## n_sessions_first3wks 52.58239
## avg_n_stories_first3wkly_sessions 90.44266
Calculating the RMSE
TrlassoPred = predict(lasso.model, as.matrix(child_Train_X))
TrlassoMSE = mean((TrlassoPred - child_Train_Y[, ]) ^ 2)
TrlassoRMSE = sqrt(TrlassoMSE)
paste("The Train RMSE of the Lasso Model is:", TrlassoRMSE)## [1] "The Train RMSE of the Lasso Model is: 55.1853787164738"
lassoPred = predict(lasso.model, as.matrix(child_Test_X))
lassoMSE = mean((lassoPred - child_Test_Y[, ]) ^ 2)
lassoRMSE = sqrt(lassoMSE)
paste("The RMSE of the Lasso Model is:", lassoRMSE)## [1] "The RMSE of the Lasso Model is: 56.9500496048001"
####SUMMARY OF ALL RMSE#### ###########################
## [1] "The Train RMSE of the Linear Regression : 55.1853499149659"
## [1] "The Train RMSE of the Lasso : 55.1853787164738"
## [1] "The Train RMSE of the Decision Tree : 54.1445693427257"
## [1] "The Train RMSE of the Boosting Model : 53.4835401890399"
## [1] "The Train RMSE of the Bagging Model : 47.6010917162449"
## [1] "The Train RMSE of the Random Forest : 48.3819323196901"
## [1] "The RMSE of the Linear Regression : 56.9501433881364"
## [1] "The RMSE of the Lasso : 56.9500496048001"
## [1] "The RMSE of the Decision Tree : 55.8302830049567"
## [1] "The RMSE of the Boosting Model : 55.0622018237479"
## [1] "The RMSE of the Bagging Model : 57.1622189482128"
## [1] "The RMSE of the Random Forest : 56.0545478170325"
Our proposal and implementation pipeline for S2M from this code file and model is as follows: